STAT 679: Problem Set #4
Q1. Political Book Recommendations
In this problem, we’ll study a network dataset of Amazon bestselling US Politics books. Books are linked by an edge if they appeared together in the recommendations (“customers who bought this book also bought these other books”).
Part (a)
The code below reads in the edges and nodes associated with
the network. The edges data set only contains IDs of co-recommended
books, while the nodes data includes attributes associated with each
book. Build a tbl_graph object to store the
graph.
edge_data_path <- "../data/political-books-edges.csv"
node_data_path <- "../data/political-books-nodes.csv"
edges <- read_csv(edge_data_path, col_types = "cci")
nodes <- read_csv(node_data_path, col_types = "ccc")
books_graph <- tbl_graph(nodes, edges, directed = FALSE)
books_graph## # A tbl_graph: 105 nodes and 441 edges
## #
## # An undirected simple graph with 1 component
## #
## # A tibble: 105 × 3
## id label political_ideology
## <chr> <chr> <chr>
## 1 0 1000 Years for Revenge neutral
## 2 1 Bush vs. the Beltway conservative
## 3 2 Charlie Wilson's War conservative
## 4 3 Losing Bin Laden conservative
## 5 4 Sleeping With the Devil neutral
## 6 5 The Man Who Warned America conservative
## # ℹ 99 more rows
## #
## # A tibble: 441 × 3
## from to weight
## <int> <int> <int>
## 1 1 2 1
## 2 1 3 1
## 3 1 4 1
## # ℹ 438 more rows
Part (b)
Use the result from part (a) to visualize the network as a node-link diagram. Include the books titles in the node label, and shade in the node according to political ideology.
ggraph(books_graph, layout="kk") +
geom_edge_link(width=0.05) +
geom_node_point(aes(col=political_ideology),size=3) +
geom_node_label(aes(label=label), repel = T, label.size=0.1) +
labs(col="Political Ideology")Part (c)
Create the analogous adjacency matrix visualization. Provide examples of visual queries that are easy to answer using one encoding but not the other (i.e., what is easy to see in the node-link view vs. what is easy to see in the adjacency matrix).
ggraph(books_graph, layout="matrix", sort.by=political_ideology) +
geom_edge_tile(show.legend = F, mirror = T) +
geom_node_point(aes(label=label, col=political_ideology),x=-1, size=1, hjust=1) +
geom_node_point(aes(label=label, col=political_ideology),y=-1, size=1, hjust=1) +
labs(col="Political Ideology")In crowded graphs, it is easy to visualize the degree of a certain nood in Adjacency matrices, as compared to node-link diagrams. On the contrary, it is easier to track paths and circuits using node-link diagrams than adjacency matrices.
Q2. Topics in Pride and Prejudice
This problem uses LDA to analyze the full text of Pride
and Prejudice. The object paragraph is a data.frame
whose rows are paragraphs from the book. We have filtered very short
paragraphs; e.g., from dialogue. We are interested in how the topics
appearing in the book vary from the start to the end of the book, for
example.
## # A tibble: 1,092 × 2
## text paragraph
## <chr> <dbl>
## 1 "however little known the feelings or views of such a man may be o… 1
## 2 "\"why, my dear, you must know, mrs. long says that netherfield is… 2
## 3 "\"i see no occasion for that. you and the girls may go, or you ma… 3
## 4 "\"my dear, you flatter me. i certainly _have_ had my share of bea… 4
## 5 "\"but consider your daughters. only think what an establishment i… 5
## 6 "\"you are over-scrupulous, surely. i dare say mr. bingley will be… 6
## 7 "\"i desire you will do no such thing. lizzy is not a bit better t… 7
## 8 "mr. bennet was so odd a mixture of quick parts, sarcastic humour,… 8
## 9 "mr. bennet was among the earliest of those who waited on mr. bing… 9
## 10 "\"i honour your circumspection. a fortnight's acquaintance is cer… 10
## # ℹ 1,082 more rows
Part (a)
Create a Document-Term Matrix containing word counts from
across the same paragraphs. That is, the i-th row of dtm
should correspond to the i-th row of paragraph. Make sure
to remove all stop-words.
by_paragraph <- paragraphs %>%
unite(document, paragraph)
word_counts <- by_paragraph %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(document, word) %>%
arrange(-n)
dtm <- word_counts %>%
cast_dtm(document, word, n)
dtm## <<DocumentTermMatrix (documents: 1092, terms: 5655)>>
## Non-/sparse entries: 30289/6144971
## Sparsity : 100%
## Maximal term length: 17
## Weighting : term frequency (tf)
Part (b)
Fit an LDA model to dtm using 6 topics. Set the
seed by using the argument control = list(seed=479) to remove any
randomness in the result.
Part (c)
Visualize the top 30 words within each of the fitted topics. Specifically, create a faceted bar chart where the lengths of the bars correspond to word probabilities and the facets correspond to topics. Reorder the bars so that each topic’s top words are displayed in order of decreasing probability.
top_words <- topics %>%
group_by(topic) %>%
slice_max(beta, n = 30) %>%
mutate(term = reorder_within(term, beta, topic))
ggplot(top_words) +
geom_col(aes(beta, term, fill=factor(topic)), show.legend = F) +
facet_wrap(~ topic, scales = "free") +
labs(title="Beta matrix of top 30 words in each topic", y="Probability of words", x="") +
scale_fill_brewer(palette = "Set2") +
scale_y_reordered() +
scale_x_continuous(expand = c(0,0))Part (d)
Find the paragraph that is the purest representative of Topic 2. That is, if \(\gamma_{ik}\) denotes the weight of topic \(k\) in paragraph \(i\), then print out paragraph \(i^{*}\) where \(i^{*}\) = \(arg max_{i} \gamma_{i2}\). Verify that the at least a few of the words with high probability for this topic appear. Only copy the first sentence into your solution.
topic2_members <- memberships %>% filter(topic==2)
pure_topic2_doc <- (topic2_members %>% arrange(-gamma) %>% pull(document))[1]
pure_topic2_para_text <- by_paragraph %>% filter(document == pure_topic2_doc) %>% pull(text)
paste0("Topic 2 is purely represented by paragraph #",pure_topic2_doc)## [1] "Topic 2 is purely represented by paragraph #546"
## [1] "\"i had not been long in hertfordshire, before i saw, in common with others, that bingley preferred your elder sister to any other young woman in the country."
Q3. Food nutrients
This problem will use PCA to provide a low-dimensional view of a 14-dimensional nutritional facts data set. The data were originally curated by the USDA and are regularly used in visualization studies.
Part (a)
Define a tidymodels recipe that normalizes all
nutrient features and specifies that PCA should be
performed.
Part (b)
Visualize the top 6 principal components. What types of food do you expect to have low or high values for PC1 or PC2?
I expect Fat sand Oils to have the lowest value for PC1, and Sweets to have the highest value for PC2.
components <- tidy(pca_prep, 2)
ggplot(components %>% filter(component %in% glue("PC{1:6}"))) +
geom_col(aes(value, terms), show.legend = F) +
scale_fill_gradient() +
facet_wrap(~ component) +
labs(title="Top 6 principal components of food nutrients", x="Component Value", y="")Part (c)
Compute the average value of PC2 within each category of the group column. Give the names of the groups sorted by this average.
sample_scores <- bake(pca_prep, NULL)
PC2_groups <- sample_scores %>%
group_by(group) %>%
summarize(avg_PC2 = mean(PC2)) %>%
arrange(-avg_PC2) %>%
pull(group)The groups sorted (highest -> lowest) by average values of PC2 are:
Spices and Herbs, Breakfast Cereals, Sweets, Snacks, Cereal Grains and Pasta, Baked Products, Beverages, Fruits and Fruit Juices, Legumes and Legume Products, Baby Foods, Meals, Entrees, and Sidedishes, Vegetables and Vegetable Products, Fast Foods, Dairy and Egg Products, Restaurant Foods, Nut and Seed Products, Soups, Sauces, and Gravies, Ethnic Foods, Finfish and Shellfish Products, Poultry Products, Pork Products, Beef Products, Sausages and Luncheon Meats, Lamb, Veal, and Game Products, Fats and Oils
Part (d)
Visualize the scores of each food item with respect to the first two principal components. Facet the visualization according to the group column, and sort the facets according to the results of part (c). How does the result compare with your guess from part (b)?
- From the below plot, I can confirm that my guess about \(Fats\) \(and\) \(Oils\) to have a low PC1 was accurate.
- \(Spices\) \(and\) \(Herbs\) seems to have a higher value of PC2, as compared to my guess of \(Sweets\).
ggplot(sample_scores) +
geom_hline(yintercept = 0, size = 0.5, col = "#5d5d5d") +
geom_vline(xintercept = 0, size = 0.5, col = "#5d5d5d") +
geom_point(aes(PC1, PC2), size=0.3, alpha=0.4) +
facet_wrap(~ factor(group, levels = PC2_groups)) +
labs(title="Score of food items w.r.t. principal components")Q4. Interactive Phylogeny
We will build an interactive phylogenetic tree of \(SARS-CoV-2\) genetic sequences. Each
sequence has been annotated with a date and location of its discovery.
We will use D3 to allow readers to explore the way genetic changes
unfold over time and space. You can find the raw data here: nodes,
edges.
We have provided starter
code to build a d3.stratify() object from the edge data
and to define an object, node_lookup, which can be used to
look up the country and date associated with the from and
to fields in the edges.
Part (a)
Create a static tree visualization that shows how the different COVID variants evolved from one another. Use color to encode the location of the variant’s discovery. You may group rare countries into “Other,” and draw variants with unknown origins using either white or grey.
Code of (a) and (b):
Part (b)
Implemented Interactivity: As the user hovers near to a node, highlight all of its ancestors. Blend the rest of the nodes into the background.
Part (c)
Propose, but do not implement, an extend version of part (b) that is linked with an additional table or visualization. How would the second graphic be updated in response to user interactions? What additional queries become possible in your proposed visualization?
- I would add a grouped bar plot of “months vs number of variants detected” to interact with the tree .
- Each month would contain a group of bars, with one bar per country, which would be assigned the same colours as seen in the tree.
- When the user clicks on a bar, the nodes corresponding to the bar will be highlighted, and the rest will blend into the background.
Q5. Hierarchical Edge Bundling
In this problem, we will study a D3 hierarchical edge bundling implementation available at this link. The display shows how different files in a software package import from one another. Unlike a naive radial node-link layout, this layout “bundles” together edges if their source and target nodes have common ancestors in the package’s directory tree (which is why the resulting layout is called a “Hierarchical Edge Bundling”).
Part (a)
Use console.log() to inspect the
root object. Describe its structure.
The root is a JSON object made up of two items. One is the name
attribute flare. The second item contains the children of
this node. It is an array of 10 JSON objects.
Part (b)
What does this line do?
This line calls a line generator line() to connect data
items i and o by drawing a path between
them
Provide one example of an edge in the original visualization
(e.g., for example xor <--> or,though this is not a
correct answer) where you believe i.path(o) contains more
than two elements, and explain your reasoning. You may find it useful to
console.log() the result from
i.path(o).
Labeler <–> Data.
Both these nodes have both outgoing and incoming edges.
Part (c)
Imagine that you are working for a biotechnology firm that is interested in visualizing a protein network. You have data on the co-occurrence frequency for all pairs of proteins (high-co-occurrence can be interpreted as the proteins lying on a shared regulatory pathway). What, if any, additional information would you need before you could implement a hierarchical edge bundling visualization of the network? Explain your reasoning.
I might need information on the,
- relation direction of the co-occurrence (if relevant), to differentiate the frequencies based on that.
- order of the hierarchy to bundle the proteins in the right order (to generate parents and children)
Q6. Interpreting UMAP
Imagine that you are a statistical consultant working with a scientist / sports team / journalist / sales division head (pick your favorite or make up your own example). At one point in your study, you found it useful to apply nonlinear dimensionality reduction with UMAP. In this problem, you are asked to provide a non-technical explanation of how to interpret the dimensionality reduction output, assuming that your audience is familiar with their data but not statistical methodology.
Part (a)
You have run UMAP on the data set, which includes 50 different measurements for each row. You plot the first two dimensions as a scatter-plot. Explain to your audience what the embedding represents.
Scenario: Analyzing play techniques among players using 50 different features
Analysis (Layman terms):
When I plot the first two dimensions of the UMAP, and single out any two points in the scatter plot, I will be presented with one of the following scenarios:
1. The points are close to each other
When this is the case, it means that the tow players represented by these points have similar game play tactics per the 50 recorded measurements
2. Points are far away from each other
In this case, the two players employ different game play tactics per the measurements i.e. the approach the game quite differently. If we consider the example of Basketball, one player might employ offense tactics to persorm well, while the other’s strengths might lie in defense tactics.
Part (b)
Though they have not heard of UMAP, your audience has previously encountered principal components analysis for dimensionality reduction. Help your audience understand UMAP by comparing and contrasting it with PCA.
If my audience is versed in the concept of Principal Component Analysis, then I will use the following analogy to explain the UMAP. When we use PCA on the players’ game play tactics, we will be able to highlight global trends and patterns in the player. Like, the player who is the best at defense, and the player who is besk at 3-pointers. if puts a spotlight on the top performers. But in the case of UMAP, it provides a “magnifying lens” to identify similarities and correlaions between the features of the players. It helps connect the skills of the players to show where they converge and diverge.